Structured Query Language With PostgreSQL
Structured Query Language (SQL) is a standard language described by ISO/IEC 9075 for creating and working with databases stored in a set of tables. The actual implementations through database management systems use this standard language and usually add their own extensions for additional specifications. The most common implementations include PostgreSQL, MySQL, Microsoft SQL Server, Oracle Database, and SQLite, where PostgreSQL will be used as a reference, since it is a popular, versatile, and open source database (although most core functionality is shared). These notes rely on the ideas and learnings from the respective package documentations and "SQL For Data Analysis: Advanced Techniques For Transforming Data Into Insights", 1st Edition, by Cathy Tanimura in 2021.
Background To Databases
A database is an efficient and organized collection of information from which data can be extracted. The data is often stored in tables with rows and columns, although there are various other designs, schemes, and formats for representation. It is also possible for multiple users to conveniently access the data for consistency, while providing the ability to assign permissions and authority for security. A distinction can be made between 1-tier architecture, where the database and application client are running on the same system; 2-tier architecture, where the database can be accessed from multiple application clients running on different systems; and 3-tier architecture, where the database is accessed by a server which then interfaces with application clients running on different systems (increases security, scalability, liability, and development). The most ubiquitous types of databases include navigation, hierarchical, network, relational, and non-relational. Depending on the complexity of the database, it may be stored on a single local centralized file system or cluster of distributed computers in the cloud.
The early development of databases began with navigation, hierarchical, and network databases, as general-purpose systems to procedurally access data. These approaches offered application clients the ability to navigate around a coupled set of data which formed into a large network with sequential paths following tree-like orders (always following references from other objects). This network consisted of the linked records with collections of fields which then contained single values. There are some differences in the specifications, but, there will generally be rules which mandate the relationships from parent records to child records. Although simple, this often leads to inflexibility and constraints, such as needing to traverse from the root node to retrieve data (lack of arbitrary navigational routes). There are some implementations which are still available, but these databases are rarely used for production operations.
A relational database is a structured mechanism used to store data in a series of tables, where the tables are connected or have a relationship with each other in the style of a ledger. Conventionally, these tables can be seen to represent various types of entities with rows representing instances of individual records and columns describing the attributes of those records. A primary key is a unique identifier used to distinguish each record within an individual table. This primary key can then be shared as a foreign key and function as a common identifier between tables for related records. Thus, it is possible to extract data from multiple tables through cross-referencing directly in relation to the relevant keys of the records. As a result, the management of the database is systematic and methodical with a consistent structure and facilities for data normalization, while also allowing for efficient backup and recovery.
A non-relational database can be used to overcome the inconvenience of object-relational impedance mismatch when using domain-driven object models. Examples of common non-relational databases include key-value databases, where unique keys are tied to associated values; column-store databases, where data is only stored in columns and optimized for performance; graph databases, where the entities in a database are graphically illustrated with their linked connections; and document-store databases, where data is stored in documents grouped into collections which can be retrieved. The advantages of non-relational databases are improved flexibility, high scalability (as horizontal scaling through more workers, rather than vertical scaling through a more computationally powerful system), and cost-effectiveness. These are sometimes referred to as NoSQL or non-SQL databases.
SQL And PostgreSQL
datastore?In general, a database consists of schemas which provide organization and structure by describing the way in which the data in a schema may relate to other components and elements of the database. Within a schema, the objects most commonly used are tables, views, and functions. As mentioned, tables contain ...fields... and have indexes (usually for the primary key or commonly used columns), which is a special kind of data structure allowing data to be retrieved more efficiently. Views are virtual tables which allow data to be presented in an alternate format without modifying or altering the underlying tables (does not actually store data, so they could be thought of as stored queries). In creating a view, it is possible to employ a set of operations (based on the mathematical system of relational calculus) to define, filter, aggregate, or transform the data extracted from the underlying tables. Functions allow commonly used sets of calculations or procedures to be stored and easily referenced in queries. A database will often have an associated data dictionary, as a document or repository which gives a clear description of the attributes of the data.
SQL was designed to offer a set-based and declarative mechanism for interacting with a database management system with a scope extending to data query, data definition, data control, and data manipulation. It can be sub-divided into several language elements, including clauses, expressions, predicates, queries, and statements. For interest, SQL was initially developed at IBM by Donald Chamberlin and Raymond Boyce after ...learning about... the relational model from Edgar Codd in the early 1970s (originally called Structured English QUEry Language (SEQUEL)). SQL became a standard of the American National Standards Institute (ANSI) in 1986 and International Organization For Standardization (ISO) in 1987. As mentioned, actual implementations through database management systems use this standard language and usually add their own extensions for additional specifications. It should be noted that the keywords in commands are not case-sensitive (and whitespace is insignificant), but it is convention to use uppercase to make it easier to distinguish between keywords, database, table, and column names (database, table, and column names are case-sensitive).
The mentioned sub-languages of SQL for communicating a database include data query, data definition, data control, and data manipulation. The data query is associated with extracting data from a database through indexing and selection from multiple schemas, tables, and views (common commands include SELECT
, FROM
, and WHERE
). The data definition is associated with creating, modifying, and deleting tables, views, and other objects in the database, where the structure is defined without affecting the content (common commands include CREATE
, ALTER
, and DROP
). The data control is associated with accessing the database by assigning or removing permissions for users (common commands include GRANT
and REVOKE
). The data manipulation is associated with adding, updating, and deleting the content of the data (common commands include INSERT
, UPDATE
, and DELETE
).
PostgreSQL (originally called POSTGRES) is an object-relational database management system. It features transactions with atomicity consistency isolation durability properties (retains integrity while performing transactions), complex queries, automatically updatable views, materialized views, triggers, foreign keys, and stored procedures. It is designed to emphasize extensibility and allow for the modification and creation of new data types, functions, operators, index methods, and procedural languages (including direct support for Perl, Tcl, and Python with external support for many other languages). Since its initial release in 1996-07, it has established a reputation for reliability, robustness, security, and performance. The official documentation is a useful resource for reference of information and tutorials.
Each of the columns in a table of a database will have an associated data type for the ...associated... rows. The common and supported data types in PostgreSQL include string as char
(fixed length), varchar
(variable length), text
(longer variable length); numerical as int
(with versions for precision), float
(with versions for precision), and decimal
(specification of decimal places); logical as boolean
(true or false); and datetime as date
(year, month, and day), time
(hours, minutes, and seconds), and timestamp
(date and time). Other useful data types include json
, xml
, and bytea
... .
Initial Installation
There are no unusual prerequisites to install PostgreSQL, apart from the minimum hardware requirements for the system. PostgreSQL can usually be installed through a package manager, as conventionally performed through the native package manager of a Linux distribution (although this version may not be officially maintained). Alternatively, the official repository link can be added to the native package manager to fetch the official version of PostgreSQL through manual configuration. ...Once installed, the PostgreSQL server is started automatically and enabled by default to start at system reboot. The status of the PostgreSQL server can be checked with systemd commands...
~ $ sudo sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' ~ $ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
~ $ sudo apt install postgresql
Creating And Deleting Databases
A database can be created with an assigned name. ... and requires the create privilege on the database. Additional options can be supplied for the creation of the database. These options include specifying the database character set (default as collate), .
It is also possible to drop all the tables in a specific database and delete the database completely. This should be used with caution and requires a drop privilege on the database. When deleting a database, the database directory is removed along with files associated with the database which were created by the implementation during operation (including .BAK, .DAT, .HSH, .MRG, .MYI, .cfg, .db, .ibd, and .ndb). This will not remove temporary tables which were created in the database beforehand, as these will only be automatically removed when the session which created them ends.
CREATE {DATABASE | SCHEMA} [IF NOT EXISTS] database_name [create_option] ... create_option: [DEFAULT] { CHARACTER SET [=] charset_name | COLLATE [=] collation_name | ENCRYPTION [=] {'Y' | 'N'} }
DROP {DATABASE | SCHEMA} [IF EXISTS] database_name
Database privileges...
...Querying...
The SELECT
clause determines which columns, specified as expressions, will be returned by the query. These expressions may also include manipulations, such as aggregating data, ...choosing... different cases, or converting data types. The FROM
clause determines the sources from which the expressions in the SELECT
clause are derived. These sources may be tables, views, or sub-queries (if multiple sources are specified, it is also necessary to include a JOIN
operation). An alias can also be assigned to the sources for convenient referencing in other clauses in the query. The WHERE
clause specifies restrictions or filters which are used to exclude rows from the results. When exploring collections of large data, the LIMIT
clause can be used to restrict the number of rows included in the results (synonymous with TOP
for some implementations of database management systems).
With regard to JOIN
operations, an INNER JOIN
will return all of the rows which match in both sources and OUTER JOIN
will return all of the rows in both tables regardless of if they match (with NULL
if for keys of unmatched rows). This operation can also be specified to have a type as LEFT JOIN
to return all of the rows from the first source but only matching rows from the second source or RIGHT JOIN
to return all of the rows from the second source but only matching rows from the first source. For ..., a CROSS JOIN
... . A Cartesian join will result when each row in a source matches multiple rows in the other source (although this should generally be avoided).
...Manipulating...
Aggregation. When aggregating data, it is also necessary to include a GROUP BY
clause to specify ... (can use columns or positional notation for referencing).